Information Extraction from Web Product Catalogues

نویسنده

  • Martin Labský
چکیده

In this paper we present preliminary results for information extraction (IE) performed over a set of HTML documents using Hidden Markov Models (HMMs). In our experiments, we restrict ourselves to the domain of bike products sold on the Internet. The information to be extracted consists of bike model attributes and details regarding the company’s offer. We experiment with three approaches utilising HMMs and present results in terms of precision and recall.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Multimedia Information Extraction in Ontology-based Semantic Annotation of Product Catalogues

—The demand for efficient methods for extracting knowledge from multimedia content has led to a growing research community investigating the convergence of multimedia and knowledge technologies. In this paper we describe a methodology for extracting multimedia information from product catalogues empowered by the synergetic use and extension of a domain ontology. The methodology was implemented ...

متن کامل

Presenting a method for extracting structured domain-dependent information from Farsi Web pages

Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extr...

متن کامل

Ontology-based Product Catalogues: An Example Implementation

Electronic Product Catalogues are the basis for offering and selling products in online market places. To be efficient, these catalogues have to provide a semantically precise description of product features to allow for effective matchmaking of products and customer requests. At the same time, the description has to follow a common terminology that allows the integration with the catalogues of...

متن کامل

Multimedia information extraction from HTML product catalogues

We describe a demo application of information extraction from company websites, focusing on bicycle product offers. A statistical approach (Hidden Markov Models) is used in combination with different ways of image classification, including latent semantic analysis of image collections. Ontological knowledge is used to group the extracted items into structured objects. The results are stored in ...

متن کامل

State of the Art and Classification of Electronic Product Catalogues on CD-ROM

16 Introduction With the expansion of the services on the World Wide Web (WWW) and the distribution of information on CD-ROM, modern electronic support of advertising and sale of goods become a key factor in the marketing strategy of many companies. Information systems which focus their attention in multimedia presentation of products or services with functions that allow searching, selection, ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004